Article

Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

It is crucial to develop powerful tools to predict apoptosis protein locations for rapidly increasing gap between the number of known structural proteins and the number of known sequences in protein databank. In this study, based on the concept of pseudo amino acid (PseAA) composition originally introduced by Chou, a novel approximate entropy (ApEn) based PseAA composition is proposed to represent apoptosis protein sequences. An ensemble classifier is introduced, of which the basic classifier is the FKNN (fuzzy K-nearest neighbor) one, as prediction engine. Each basic classifier is trained in different dimensions of PseAA composition of protein sequences. The immune genetic algorithm (IGA) is used to search the optimal weight factors in generating the PseAA composition for crucial of weight factors in PseAA composition. The results obtained by Jackknife test are quite encouraging, indicating that the proposed method might become a potentially useful tool for protein function, or at least can play a complimentary role to the existing methods in the relevant areas.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Computational approaches, which are developed through machine learning techniques, comprise the informative feature vector and suitable prediction algorithms [7]. A variety of classifiers have been employed to develop protein function prediction model, which includes decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), and ensemble classifiers such as AdaBoost, gradient boosting machine (GBM), and random forest (RF) [8][8] [8] [8]. From the studies, it has been observed that considerable improvements in the performance of prediction model are achieved through augmented features which includes sequence, physicochemical, and evolutionary information of the protein sequence as compared to single feature [12]- [13]. ...
... Computational approaches, which are developed through machine learning techniques, comprise the informative feature vector and suitable prediction algorithms [7]. A variety of classifiers have been employed to develop protein function prediction model, which includes decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), and ensemble classifiers such as AdaBoost, gradient boosting machine (GBM), and random forest (RF) [8][8] [8] [8]. From the studies, it has been observed that considerable improvements in the performance of prediction model are achieved through augmented features which includes sequence, physicochemical, and evolutionary information of the protein sequence as compared to single feature [12]- [13]. ...
... Computational approaches, which are developed through machine learning techniques, comprise the informative feature vector and suitable prediction algorithms [7]. A variety of classifiers have been employed to develop protein function prediction model, which includes decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), and ensemble classifiers such as AdaBoost, gradient boosting machine (GBM), and random forest (RF) [8][8] [8] [8]. From the studies, it has been observed that considerable improvements in the performance of prediction model are achieved through augmented features which includes sequence, physicochemical, and evolutionary information of the protein sequence as compared to single feature [12]- [13]. ...
Article
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
... Chen and Li [39] constructed a dataset containing 317 apoptosis protein sequences and obtained higher prediction accuracy, which combined support vector machine and increment of diversity (named as ID_SVM) by using jackknife test. Similarly, Ding et al. [40] used the Fuzzy K-nearest neighbor (FKNN) algorithm and the overall prediction accuracy was 90.9% using CL317 dataset. Qiu et al. [41] used the DWT_SVM method to obtain high prediction accuracy rates of 97.5, 87.6 and 88.8% for CL317, ZW225 and ZD98 datasets, respectively by jackknife test. ...
... As can be seen from Table 10, the OA of CL317 dataset is 99.7% by using PsePSSM-DCCA-LFDA, which is 2.2-17% higher than other prediction methods. We can find that the overall accuracy by our method is higher than that of ID [93], ID_SVM [39], DF_SVM [21], FKNN [40] and so on. The value of sensitivity for each protein class is listed. ...
... As can be seen from Table 11, the OA of ZW225 dataset is 99.6% using PsePSSM-DCCA-LFDA, which is almost 16.5, 15.6, 13.8 and 12.5% higher than EBGW_SVM [15], DF_SVM [21], FKNN [40], Auto_Cova [42], respectively. Especially for the most difficult case-mitochondrial proteins, the predictive accuracy has improved to 100% by our method, which is 40% higher than that of the EBGW_SVM [15], 36% higher than the prediction accuracy of DF_SVM [21]. ...
Article
Full-text available
Background Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. Results In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. Conclusion The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/.
... These methods were developed based on; (1) the design of the protein encoding scheme of the feature extraction; (2) the selection of the classifier [7]. Some sequence features are used for the first task, e.g., amino acid composition [8], dipeptide composition, which represents the composition of amino acid pairs and gapped amino acid pairs [9], pseudo amino acid composition [10][11][12][13][14], Markov chains [15], wavelet coefficients [3], distance frequency [16], grouped weight encoding [2], PSSMs [7,17,18], and gene ontology [19,20]. For example, the Markov chains, being a discrete stochastic model [21], contain the frequencies of 20 native amino acids and the information of amino acid pairs in protein sequences, which reflect the composition and local amino acid order of the protein sequences. ...
... The PSSM reflects the evolutionary information of a protein sequence, and has been used for the prediction of protein function [23], subcellular location [5], and structural class [24,25]. In addition, a few machine learning algorithms have been developed for the second task, including the fuzzy k-nearest neighbor algorithm [12], SVM [3,7,[16][17][18], covariant discrimination algorithm [9], and ensemble classifier [26,27]. Among these, the SVM proposed by Vapnik [28] exhibited the most promising results [7]. ...
... The ACC transformation method was developed by Wold et al. [30], and has been widely used in protein family classification and protein interaction prediction [31,32]. Although computational methods, such as PSSM-trigram [7] and FKNN (fast k-nearest neighbor algorithm) [12], have been reported to reliably identify the subcellular location of APs, there is still room for improvement of the prediction accuracy. In our previous research, we established highly accurate protein structural class prediction methods based on the PSSMs using the SVM classifier [25,32]. ...
Article
Full-text available
Apoptosis proteins (APs) control normal tissue homeostasis by regulating the balance between cell proliferation and death. The function of APs is strongly related to their subcellular location. To date, computational methods have been reported that reliably identify the subcellular location of APs, however, there is still room for improvement of the prediction accuracy. In this study, we developed a novel method named iAPSL-IF (identification of apoptosis protein subcellular location—integrative features), which is based on integrative features captured from Markov chains, physicochemical property matrices, and position-specific score matrices (PSSMs) of amino acid sequences. The matrices with different lengths were transformed into fixed-length feature vectors using an auto cross-covariance (ACC) method. An optimal subset of the features was chosen using a recursive feature elimination (RFE) algorithm method, and the sequences with these features were trained by a support vector machine (SVM) classifier. Based on three datasets ZD98, CL317, and ZW225, the iAPSL-IF was examined using a jackknife cross-validation test. The resulting data showed that the iAPSL-IF outperformed the known predictors reported in the literature: its overall accuracy on the three datasets was 98.98% (ZD98), 94.95% (CL317), and 97.33% (ZW225), respectively; the Matthews correlation coefficient, sensitivity, and specificity for several classes of subcellular location proteins (e.g., membrane proteins, cytoplasmic proteins, endoplasmic reticulum proteins, nuclear proteins, and secreted proteins) in the datasets were 0.92–1.0, 94.23–100%, and 97.07–100%, respectively. Overall, the results of this study provide a high throughput and sequence-based method for better identification of the subcellular location of APs, and facilitates further understanding of programmed cell death in organisms.
... When we study methods of protein fold recognition, we found that less attention has been paid to the fusion of features to get more comprehensive features. In recent studies, researchers attempted to find new feature extraction methods[ [5,6,7,8,35 9]] or train different classifiers to achieve high accuracy[ [10,11,12,13]], even though some problems like incomplete data sources, false positive information, multiple aspect problem,. . . encourage us to combine data sources. ...
... Parameters were used as inputs of the artificial neural networks [19]. The composition entropy was proposed to 110 represent apoptosis protein sequences, and an ensemble classifier FKNN (fuzzy K-nearest neighbor) was used as a predictor [13]. ...
Preprint
Full-text available
Protein fold recognition plays a crucial role in discovering three-dimensional structure of proteins and protein functions. Several approaches have been employed for the prediction of protein folds. Some of these approaches are based on extracting features from protein sequences and using a strong classier. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physiochemical-based information to extract features. In recent years, Finding an efficient technique for integrating discriminate features have been received advancing attention. In this study, we integrate Auto-Cross-Covariance (ACC) and Separated dimer (SD) evolutionary feature extraction methods. The results features are scored by Information gain (IG) to dene and select several discriminated features. According to three benchmark datasets, DD, RDD and EDD, the results of the support vector machine (SVM) show more than 6% improvement in accuracy on these benchmark datasets
... The Chou's PseAAC based-methods achieved about an increase of 20 percent of predicting accuracy than amino acids composition-based methods; (3) the hybrid methods allowing for integrating features from multiple views, which usually increase prediction accuracy [8][9][10]. After the sequence feature was constructed, various classifiers including covariant discriminant (CDC) [10,11], nearest neighbor (NN) [12,13], support vector machine (SVM) [14], deep learning [15] and ensemble classifier [16,17] were adopted to predict protein subcellular localization. ...
... Chen et al. utilized the measure of diversity and increment of diversity on protein primary sequences [18]. Ding et al. represented the apoptosis protein sequences by a novel approximate entropy (ApEn)-based PseAAC and employed an ensemble classifier model as the prediction engine, of which the basic classifier is the fuzzy K-nearest neighbor [16]. Lin et al. refined the PseAAC based on the physico-chemical characteristics of the 20 amino acids, and adopted SVM to predict protein subcellular locations [19]. ...
Article
Full-text available
The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou’s pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.
... The CL317 dataset is the latest and largest existing dataset, which includes 112 cytoplasmic proteins, 55 membrane proteins, 34 mitochondrial proteins, 17 secreted proteins, 52 nuclear proteins, and 47 endoplasmic reticulum proteins. In the second step, many methods have been used to extract core and essential features of the apoptosis protein samples, such as amino acid composition [8], pseudo-amino-acid composition [6,7,[9][10][11][12], group weight coding [5], distance frequency [13], autocovariance transformation based on position-specific score matrix (PSSM-AC) [14], and Gene Ontology (GO) annotation information [15]. In the last step, some common 2 Computational Intelligence and Neuroscience machine learning algorithms, for example, support vector machine (SVM) [13,14,16], fuzzy k-nearest neighbor (FKNN) [9,10], and ensemble learning [17,18], have been used to perform the prediction. ...
... In the second step, many methods have been used to extract core and essential features of the apoptosis protein samples, such as amino acid composition [8], pseudo-amino-acid composition [6,7,[9][10][11][12], group weight coding [5], distance frequency [13], autocovariance transformation based on position-specific score matrix (PSSM-AC) [14], and Gene Ontology (GO) annotation information [15]. In the last step, some common 2 Computational Intelligence and Neuroscience machine learning algorithms, for example, support vector machine (SVM) [13,14,16], fuzzy k-nearest neighbor (FKNN) [9,10], and ensemble learning [17,18], have been used to perform the prediction. ...
Article
Full-text available
Apoptosis proteins play an important role in the mechanism of programmed cell death. Predicting subcellular localization of apoptosis proteins is an essential step to understand their functions and identify drugs target. Many computational prediction methods have been developed for apoptosis protein subcellular localization. However, these existing works only focus on the proteins that have one location; proteins with multiple locations are either not considered or assumed as not existing when constructing prediction models, so that they cannot completely predict all the locations of the apoptosis proteins with multiple locations. To address this problem, this paper proposes a novel multilabel predictor named MultiP-Apo, which can predict not only apoptosis proteins with single subcellular location but also those with multiple subcellular locations. Specifically, given a query protein, GO-based feature extraction method is used to extract its feature vector. Subsequently, the GO feature vector is classified by a new multilabel classifier based on the label-specific features. It is the first multilabel predictor ever established for identifying subcellular locations of multilocation apoptosis proteins. As an initial study, MultiP-Apo achieves an overall accuracy of 58.49% by jackknife test, which indicates that our proposed predictor may become a very useful high-throughput tool in this area.
... The Pseudo-amino acid composition (PAAC) is a descriptor for peptide sequences that captures both the composition and the order of amino acids (Ding & Zhang, 2008). This descriptor extends the traditional amino acid composition by incorporating the order information through two key parameters: the sequence order correlation factor λ and a weighting factor ω. ...
Article
Full-text available
The emergence and spread of antibiotic‐resistant bacteria pose a significant public health threat, necessitating the exploration of alternative antibacterial strategies. Antibacterial peptide (ABP) is a kind of antimicrobial peptide (AMP) that has the potential ability to fight against bacteria infection, offering a promising avenue for developing novel therapeutic interventions. This study introduces AMPActiPred, a three‐stage computational framework designed to identify ABPs, characterize their activity against diverse bacterial species, and predict their activity levels. AMPActiPred employed multiple effective peptide descriptors to effectively capture the compositional features and physicochemical properties of peptides. AMPActiPred utilized deep forest architecture, a cascading architecture similar to deep neural networks, capable of effectively processing and exploring original features to enhance predictive performance. In the first stage, AMPActiPred focuses on ABP identification, achieving an Accuracy of 87.6% and an MCC of 0.742 on an elaborate dataset, demonstrating state‐of‐the‐art performance. In the second stage, AMPActiPred achieved an average GMean at 82.8% in identifying ABPs targeting 10 bacterial species, indicating AMPActiPred can achieve balanced predictions regarding the functional activity of ABP across this set of species. In the third stage, AMPActiPred demonstrates robust predictive capabilities for ABP activity levels with an average PCC of 0.722. Furthermore, AMPActiPred exhibits excellent interpretability, elucidating crucial features associated with antibacterial activity. AMPActiPred is the first computational framework capable of predicting targets and activity levels of ABPs. Finally, to facilitate the utilization of AMPActiPred, we have established a user‐friendly web interface deployed at https://awi.cuhk.edu.cn/∼AMPActiPred/.
... AAC can only provide composition information of the protein sequence, whereas pseudo-amino acid composition (PAAC) incorporates sequence order effects and frequencies of 20 amino acids in a composite encoding [45]. ...
Article
Full-text available
One of the major challenges in cancer therapy lies in the limited targeting specificity exhibited by existing anti-cancer drugs. Tumor-homing peptides (THPs) have emerged as a promising solution to this issue, due to their capability to specifically bind to and accumulate in tumor tissues while minimally impacting healthy tissues. THPs are short oligopeptides that offer a superior biological safety profile, with minimal antigenicity, and faster incorporation rates into target cells/tissues. However, identifying THPs experimentally, using methods such as phage display or in vivo screening, is a complex, time-consuming task, hence the need for computational methods. In this study, we proposed StackTHPred, a novel machine learning-based framework that predicts THPs using optimal features and a stacking architecture. With an effective feature selection algorithm and three tree-based machine learning algorithms, StackTHPred has demonstrated advanced performance, surpassing existing THP prediction methods. It achieved an accuracy of 0.915 and a 0.831 Matthews Correlation Coefficient (MCC) score on the main dataset, and an accuracy of 0.883 and a 0.767 MCC score on the small dataset. StackTHPred also offers favorable interpretability, enabling researchers to better understand the intrinsic characteristics of THPs. Overall, StackTHPred is beneficial for both the exploration and identification of THPs and facilitates the development of innovative cancer therapies.
... PAAC is an effective peptide descriptor to represent amino acid sequences [35,36]. The AAC, DPC, CKSAAGP descriptors can provide representative information about the sequence-based characterization and motifs by calculating the occurrence of different patterns within the peptide but no sequence order information. ...
Article
Full-text available
Antiviral peptide (AVP) is a kind of antimicrobial peptide (AMP) that has the potential ability to fight against virus infection. Machine learning-based prediction with a computational biology approach can facilitate the development of the novel therapeutic agents. In this study, we proposed a double-stage classification scheme, named AVPIden, for predicting the AVPs and their functional activities against different viruses. The first stage is to distinguish the AVP from a broad-spectrum peptide collection, including not only the regular peptides (non-AMP) but also the AMPs without antiviral functions (non-AVP). The second stage is responsible for characterizing one or more virus families or species that the AVP targets. Imbalanced learning is utilized to improve the performance of prediction. The AVPIden uses multiple descriptors to precisely demonstrate the peptide properties and adopts explainable machine learning strategies based on Shapley value to exploit how the descriptors impact the antiviral activities. Finally, the evaluation performance of the proposed model suggests its ability to predict the antivirus activities and their potential functions against six virus families (Coronaviridae, Retroviridae, Herpesviridae, Paramyxoviridae, Orthomyxoviridae, Flaviviridae) and eight kinds of virus (FIV, HCV, HIV, HPIV3, HSV1, INFVA, RSV, SARS-CoV). The AVPIden gives an option for reinforcing the development of AVPs with the computer-aided method and has been deployed at http://awi.cuhk.edu.cn/AVPIden/.
... PAAC [10] is claimed as an effective peptide descriptor for resolving many proteins/amino acid sequences related problems [15,35,42,50]. The regular AAC or DiC barely consider the sequenceorder information. ...
Article
Full-text available
As the current worldwide outbreaks of the SARS-CoV-2, it is urgently needed to develop effective therapeutic agents for inhibiting the pathogens or treating the related diseases. Antimicrobial peptides (AMP) with functional activity against coronavirus could be a considerable solution, yet there is no research for identifying anti-coronavirus (anti-CoV) peptides with the computational approach. In this study, we first investigated the physiochemical and compositional properties of the collected anti-CoV peptides by comparing against three other negative sets: antivirus peptides without anti-CoV function (antivirus), regular AMP without antivirus functions (non-AVP) and peptides without antimicrobial functions (non-AMP). Then, we established classifiers for identifying anti-CoV peptides between different negative sets based on random forest. Imbalanced learning strategies were adopted due to the severe class-imbalance within the datasets. The geometric mean of the sensitivity and specificity (GMean) under the identification from antivirus, non-AVP and non-AMP reaches 83.07%, 85.51% and 98.82%, respectively. Then, to pursue identifying anti-CoV peptides from broad-spectrum peptides, we designed a double-stages classifier based on the collected datasets. In the first stage, the classifier characterizes AMPs from regular peptides. It achieves an area under the receiver operating curve (AUCROC) value of 97.31%. The second stage is to identify the anti-CoV peptides between the combined negatives of other AMPs. Here, the GMean of evaluation on the independent test set is 79.42%. The proposed approach is considered as an applicable scheme for assisting the development of novel anti-CoV peptides. The datasets and source codes used in this study are available at https://github.com/poncey/PreAntiCoV.
... Parameters were used as inputs of the artificial neural networks 28 . The composition entropy was proposed to represent apoptosis protein sequences, and an ensemble classifier FKNN (fuzzy K-nearest neighbor) was used as a predictor 16 . TAXFOLD 29 method extracted sequence evolution features from PSI-BLAST profiles and also the secondary structure features from PSIPRED profiles, finally a set of 137 features is constructed to predict protein folds. ...
Article
Full-text available
Protein fold recognition plays a crucial role in discovering three-dimensional structure of proteins and protein functions. Several approaches have been employed for the prediction of protein folds. Some of these approaches are based on extracting features from protein sequences and using a strong classifier. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physicochemical-based information to extract features. In recent years, finding an efficient technique for integrating discriminate features have been received advancing attention. In this study, we integrate Auto-Cross-Covariance and Separated dimer evolutionary feature extraction methods. The results’ features are scored by Information gain to define and select several discriminated features. According to three benchmark datasets, DD, RDD ,and EDD, the results of the support vector machine show more than 6% improvement in accuracy on these benchmark datasets.
... In 2003, Zhou and Doctor [9] firstly put forward subcellular location of apoptosis proteins. Based on their research, many prediction algorithms are proposed one after another, including PseAAC with FKNN, PseAAC with SVM, distance frequency with SVM, covariance transformation, deep learning, fusion methods [9][10][11][12][13][14][15][16][17][18][19][20][21][22]. And GO annotation [23,24], discrete wavelet transform [25,26] and other methods were also introduced. ...
Article
Full-text available
Background: Subcellular localization prediction of protein is an important component of bioinformatics, which has great importance for drug design and other applications. A multitude of computational tools for proteins subcellular location have been developed in the recent decades, however, existing methods differ in the protein sequence representation techniques and classification algorithms adopted. Results: In this paper, we firstly introduce two kinds of protein sequences encoding schemes: dipeptide information with space and Gapped k-mer information. Then, the Gapped k-mer calculation method which is based on quad-tree is also introduced. Conclusions: >From the prediction results, this method not only reduces the dimension, but also improves the prediction precision of protein subcellular localization.
... It is distributed in the brain and shows pervasive inhibition of neurons. Multiple studies suggest that in some diseases such as cerebral ischemic injury and HIV-associated dementia, cellular apoptosis is closely related to changes in the concentrations of amino acid neurotransmitters [17][18][19][20][21]. ...
Article
Autism spectrum disorder (ASD) is classified as a neurodevelopmental disorder characterized by reduced social communication as well as repetitive behaviors. Many studies have proved that defective synapses in ASD influence how neurons in the brain connect and communicate with each other. Synaptopathies arise from alterations that affecting the integrity and/or functionality of synapses and can contribute to synaptic pathologies. This study investigated the GABA levels in plasma being an inhibitory neurotransmitter, caspase 3 and 9 as pro-apoptotic proteins in 20 ASD children and 20 neurotypical controls using the ELISA technique. Analysis of receiver-operating characteristic (ROC) of the data that was obtained to evaluate the diagnostic value of the aforementioned evaluated biomarkers. Pearson’s correlations and multiple regressions between the measured variables were also done. While GABA level was reduced in ASD patients, levels of caspases 3 and 9 were significantly higher when compared to neurotypical control participants. ROC and predictiveness curves showed that caspases 3, caspases 9, and GABA might be utilized as predictive markers in autism diagnosis. The present study indicates that the presence of GABAergic dysfunction promotes apoptosis in Egyptian ASD children. The obtained GABA synaptopathies and their connection with apoptosis can both relate to neuronal excitation, and imbalance of the inhibition system, which can be used as reliable predictive biomarkers for ASD.
... These methods focus mainly on two aspects: (1) the construction of protein sequence encoding schemes and feature extraction; and (2) the design of a classification algorithm. There exist multiple machine learning techniques to estimate protein subcellular positions, such as covariant discriminant [2], fuzzy k-nearest neighbor [3,4], support vector machine (SVM) [5][6][7][8], and ensemble classifier [9,10]. Among these, SVM is widely used for its robust prediction performance. ...
Article
Full-text available
To reveal the working pattern of programmed cell death, knowledge of the subcellular location of apoptosis proteins is essential. Besides the costly and time-consuming method of experimental determination, research into computational locating schemes, focusing mainly on the innovation of representation techniques on protein sequences and the selection of classification algorithms, has become popular in recent decades. In this study, a novel tri-gram encoding model is proposed, which is based on using the protein overlapping property matrix (POPM) for predicting apoptosis protein subcellular location. Next, a 1000-dimensional feature vector is built to represent a protein. Finally, with the help of support vector machine-recursive feature elimination (SVM-RFE), we select the optimal features and put them into a support vector machine (SVM) classifier for predictions. The results of jackknife tests on two benchmark datasets demonstrate that our proposed method can achieve satisfactory prediction performance level with less computing capacity required and could work as a promising tool to predict the subcellular locations of apoptosis proteins.
... In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, sub-sampling test, and jackknife test [82]. However, as elucidated by [83] and demonstrated in [84], among the three cross-validation methods, the jackknife test is deemed the most objective that can always yield a unique result for a given benchmark dataset, and hence has been increasingly and widely used by investigators to examine the accuracy of various predictors (see, e.g., [85][86][87][88][89][90][91][92][93][94][95]). In the current study, because the jackknife test would take a lot of computational time, we choose to use the independent dataset test to examine the prediction accuracy. ...
Chapter
The MARCH-INSIDE approach is a computational method that can be used to seek Quantitative Structure-Property Relationships (QSAR) models in genes and their product RNA and/or proteins without to rely upon sequence alignment. The present chapter is devoted to review previous applications of MARCH-INSIDE predict the function of new sequences experimentally discovered and discuss the legal issues related to using QSAR and in general Bioinformatics models in real research and development problems in Plant Genomics. In this sense, first we give some details on the theoretical basis of MARCHINSIDE. Next, we review the previous works reported on the applications of MARCHINSIDE in Plant Genomics including the isolation and prediction of new gene and/or gene products (protein, RNAs) such as: 1) Ribonucleases (RNAses), 2) 1- aminocyclopropane-1-carboxylate oxidases and synthases (ACOs), 3) Polygalacturonases (PGs) and 4) 18S Ribosomal RNAs (18S-rRNAs). Last, we discuss the legal issues that should be considered when use MARCH-INSIDE or other QSAR approaches such as: copyright, patents, and taxes. From this chapter it is possible to conclude that MARCH-INSIDE models may be applied in Plant Genomics and Biotechnology to find new interesting enzymes without relying upon alignment techniques. The classification power of these models is comparable to sequencealignment based methods like BLAST. There are legal aspects derived from the use of QSAR models in this field that should be strongly taken into consideration when we use it.
... On the new dataset containing 317 apoptosis protein sequences classified into six subcellular locations, and obtained higher prediction accuracy by jackknife test. Based on the CL317 dataset, Ding et al. [35] obtained the overall prediction accuracy of 90.9% by using the Fuzzy K-nearest neighbor (FKNN) algorithm. Qiu et al. [36] proposed a novel approach by combining discrete wavelet transform with support vector machine (named as DWT_SVM). ...
Article
Full-text available
Apoptosis proteins subcellular localization information are very important for understanding the mechanism of programmed cell death and the development of drugs. The prediction of subcellular localization of an apoptosis protein is still a challenging task because the prediction of apoptosis proteins subcellular localization can help to understand their function and the role of metabolic processes. In this paper, we propose a novel method for protein subcellular localization prediction. Firstly, the features of the protein sequence are extracted by combining Chou's pseudo amino acid composition (PseAAC) and pseudo-position specific scoring matrix (PsePSSM), then the feature information of the extracted is denoised by two-dimensional (2-D) wavelet denoising. Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of apoptosis proteins. Quite promising predictions are obtained using the jackknife test on three widely used datasets and compared with other state-of-the-art methods. The results indicate that the method proposed in this paper can remarkably improve the prediction accuracy of apoptosis protein subcellular localization, which will be a supplementary tool for future proteomics research.
... During the last two decades or so, many computational methods were developed to address this problem (see e.g. (Cai and Chou, 2000;Cedano et al., 1997;Chou and Cai, 2002;Elrod, 1998, 1999a, b;Ding and Zhang, 2008;Emanuelsson et al., 2000;Gardy, 2003;Nanni and Lumini, 2008;Reinhardt and Hubbard, 1998) as well as two review papers (Chou and Shen, 2007c;Nakai, 2000) and a long list of references cited therein]. ...
Article
Motivation: For in-depth understanding the functions of proteins in a cell, the knowledge of their subcellular localization is indispensable. The current study is focused on human protein subcellular location prediction based on the sequence information alone. Although considerable efforts have been made in this regard, the problem is far from being solved yet. Most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions that are particularly important for both basic research and drug design. Results: Using the multi-label theory, we present a new predictor called "pLoc-mHum" by extracting the crucial GO (Gene Ontology) information into the general PseAAC (Pseudo Amino Acid Composition). Rigorous cross-validations on a same stringent benchmark dataset have indicated that the proposed pLoc-mHum predictor is remarkably superior to iLoc-Hum, the state-of-the-art method in predicting the human protein subcellular localization. Availability: To maximize the convenience of most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc-mHum/, by which users can easily get their desired results without the need to go through the complicated mathematics involved. Supplementary information: Supplementary data are available at Bioinformatics online.
... Obtaining the relevant information from this sequence is necessary. During the past years, several machine learning methods including covariant discriminate function [11], support vector machine (SVM) [12][13][14], the L. Z. Guan, S. W. Zhang increment of diversity combined with support vector machine (ID SVM) [15][16] and k-nearest neighbor (KNN), have been extended to predict extracellular matrix protein, such as, Alexandra et al. [7] produced a bioinformatic strategy to predict the silicon "matrisome" marked as the ensemble of ECM proteins and associated factors. Seminara et al. [8] recommended that the secretion of EPS drives surface motility by creating osmotic pressure gradients in the extracellular space. ...
Article
Full-text available
Extracellular matrix proteins play a major role in the tissues of multicellular organisms. The extracellular matrix presents structural support for cells inside a tumor. Meanwhile, it also works homeostatically to mediate the interaction between cells. However, the current bioinformatics tools to predict the extracellular matrix proteins seem often fail. This work introduces a method for predicting the ECM proteins from the protein sequence as well as the molecular characteristics. We report a novel hybrid animal migration optimization and random forest (RF) method to predict extracellular matrix protein sequences adopting four various features design methods. Binary animal migration optimization (AMORF) is used to select a near-optimal subset of informative features that are most relevant for the classification. AMORF experimented on a dataset including 145 ECM and 3887 non-ECM proteins. Our algorithm performs 86.4700% accuracy, a sensitive of 84.9655%, a specificity of 86.5261%, an MCC of 0.3627 and an AUC of 0.877804. The results confirm that the proposed method is promising. From the results, we can summarize that it can choose small subsets of features and still increase the classification efficiency.
Chapter
Full-text available
Due to the rapid development in urbanization as well as industrial sector, a large number of pollutants and toxic products are generated into to the environment which has raised the concern in developing and developed nations round the world. In the recent times, the utilization of nanotechnology-based approaches has highly proved to be efficient for the detection, degradation, and removal of hazardous pesticides from the contaminated sites. The use of nanomaterials exhibit unique physicochemical properties, and hence they have received much attention among the researchers in arena of environmental bioremediation. This chapter extensively covers the recent progress and understanding in the field of nanobioremediation and also its future perspective.KeywordsPesticidesNanobioremediation (NBR)BioremediationNanomaterial
Article
Full-text available
DNA replication is one of the specific processes to be considered in all the living organisms, specifically eukaryotes. The prevalence of DNA replication is significant for an evolutionary transition at the beginning of life. DNA replication proteins are those proteins which support the process of replication and are also reported to be important in drug design and discovery. This information depicts that DNA replication proteins have a very important role in human bodies, however, to study their mechanism, their identification is necessary. Thus, it is a very important task but, in any case, an experimental identification is time-consuming, highly-costly and laborious. To cope with this issue, a computational methodology is required for prediction of these proteins, however, no prior method exists. This study comprehends the construction of novel prediction model to serve the proposed purpose. The prediction model is developed based on the artificial neural network by integrating the position relative features and sequence statistical moments in PseAAC for training neural networks. Highest overall accuracy has been achieved through tenfold cross-validation and Jackknife testing that was computed to be 96.22% and 98.56%, respectively. Our astonishing experimental results demonstrated that the proposed predictor surpass the existing models that can be served as a time and cost-effective stratagem for designing novel drugs to strike the contemporary bacterial infection.
Article
Full-text available
Background The liver is a vital organ in the human body involved in the metabolic processes. Any damage to the liver due to factors such as protein deficiency, viral infection, as well as consumption of alcohol, chemical contaminants, and adulterated foods. High blood cholesterol, high blood pressure, diabetes, lack of exercise, poor diet, obesity and cigarette smoking are major risk factor for stroke, heart attack and coronary heart disease (CHD). In medical science, number of synthetic drugs has discovered and used for treatment of people suffering from liver injury and CHD, but they were not always effective and sometimes difficult to manage by medical therapies and found also to be accompanied with other negative side effects. Objective The review of the study was to critically review the recent research and studies of epidemiological and randomized control trials to find out the effective cereal protein as an alternative preventive food to reduce the CHD and protect the liver from viral hepatic diseases focusing daily food intake, body weight, liver weight, serum enzyme activities and cholesterols. Methods A few of data was used from our experiment, a literature search was performed from reliable source of the published research article and reviewed papers, epidemiological and randomized control trials on the effects of cereal protein on animals and human intervention by Google, Google scholar, Redcube, Endnote, in Scopus, SpringerDirect.com, PubMed and Web of Science. Then the data was organized, summarized and analyzed. Results In medical science, serum enzyme activities aspartate aminotransferase (AST), alanine aminotransferase (ALT), lactate dehydrogenase (LDH) and lipid peroxidation stress malondialdehyde (MDA) are commonly used as biochemical markers of the liver damaging agent. Blood cholesterols (total cholesterol-TC, triglyceride-TG, low density lipoprotein cholesterol-LDLC and high-density lipoprotein cholesterol-HDLC) are used as the marker of heart diseases. The review shows that daily food intake and body weight data is not significantly differed among normal diet, casein (CAS) and cereals protein. The millet and wheat protein increases the liver weight whereas the rice protein lowers the liver weight. The intake of cereals protein significantly reduces the activities of serum AST, ALT, LDH, MDA, TC, TG and LDLC where it increases the HDLC. Conclusion Experimental, review and randomized controls (RCTs) data confirm that cereal protein appears to be beneficial in reducing the hepatic liver injury and CHD by maintaining body weight, liver weight, blood pressure, serum enzyme activities AST, ALT and LDH, lipid peroxidation stress MDA and cholesterol concentrations both in plasma and liver.
Article
Full-text available
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of “pseudo amino acid components” and “pseudo K-tuple nucleotide composition” have been proposed. The ideas and their approaches have further stimulated the birth for “distorted key theory”, “wenxing diagram”, and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous “5-steps rule”. All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Article
During the last three decades or so, many efforts have been made to study the protein cleavage sites by some disease-causing enzyme, such as HIV (human immunodeficiency virus) protease and SARS (severe acute respiratory syndrome) coronavirus main proteinase. It has become increasingly clear via this minireview that the motivation driving the aforementioned studies is quite wise, and that the results acquired through these studies are very rewarding, particularly for developing peptide drugs.
Article
Full-text available
In this minireview paper it has been elucidated that the proposal of pseudo amino acid components represents a very important milestone for the disciplines of proteome and genome. This has been concluded by observing and analyzing the developments in the following six different sub-disciplines: (1) proteome analysis; (2) genome analysis; (3) protein structural classification; (4) protein subcellular location prediction; (5) post-translational modification (PTM) site prediction; (6) stimulating the birth of the renowned and very powerful 5-steps rule.
Article
Full-text available
Identification of the sites of post-translational modifications (PTMs) in protein, RNA, and DNA sequences is currently a very hot topic. This is because the information thus obtained is very useful for in-depth understanding the biological processes at the cellular level and for developing effective drugs against major diseases including cancers as well. Although this can be done by means of various experimental techniques, it is both time-consuming and costly to determine the PTM sites purely based on experiments. With the avalanche of biological sequences generated in the post-genomic age, it is highly desired to develop bioinformatics tools for rapidly and effectively identifying the PTM sites. In the last few years, many efforts have been made in this regard, and considerable progresses have been achieved. This review is focused on those prediction methods that have the following two features. (1) They have been developed by strictly observing the 5-steps rule so that they each have a user-friendly web-server for the majority of experimental scientists to easily get their desired data without the need to go through the detailed mathematics involved. (2) Their cornerstones have been based on Pseudo Amino Acid Composition (PseAAC) or Pseudo K-tuple Nucleotide Composition (PseKNC), and hence the prediction quality is generally higher than most of the other PTM prediction methods.
Article
Background/objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called "pLoc-mEuk" was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.
Article
Objective: Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called "pLoc-mVirus" was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as "multiplex proteins", may simultaneously occur in, or move between, two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. Methods: Using the general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called "pLoc_bal-mVirus" for predicting the subcellular localization of multi-label virus proteins. Results: Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-the-art predictor for the same purpose. Conclusion: Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_bal-mVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding the biological process in a cell.
Article
One of the hottest topics in molecular cell biology is to determine the subcellular localization of proteins from various different organisms. This is because it is crucially important for both basic research and drug development. Recently, a predictor called "pLoc-mGneg" was developed for identifying the subcellular localization of Gram-negative bacterial proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mGneg was trained by an extremely skewed dataset in which some subset (subcellular location) was about 5 to 70 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. To alleviate such a consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mGneg by quasi-balancing the training dataset. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mGneg, the existing state-of-the-art predictor in identifying the subcellular localization of Gram-negative bacterial proteins. To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mGneg/, by which users can easily get their desired results without the need to go through the detailed mathematics.
Article
A cell contains numerous protein molecules. One of the fundamental goals in molecular cell biology is to determine their subcellular locations since this information is extremely important to both basic research and drug development. In this paper, we report a novel and very powerful predictor called "pLoc_bal-mHum" for predicting the subcellular localization of human proteins based on their sequence information alone. Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the new predictor is remarkably superior to the existing state-of-the-art predictor in identifying the subcellular localization of human proteins. To maximize the convenience for the majority of experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mHum/, by which users can easily get their desired results without the need to go through the detailed mathematics.
Article
Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called "pLoc-mGpos" was developed for identifying the subcellular localization of Gram-positive bacterial proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mGpos was trained by an extremely skewed dataset in which some subset (subcellular location) was over 11 times the size of the other subsets. Accordingly, it cannot avoid the bias consequence caused by such an uneven training dataset. To alleviate such bias consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mGpos by quasi-balancing the training dataset. Rigorous target jackknife tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mGpos, the existing state-of-the-art predictor in identifying the subcellular localization of Gram-positive bacterial proteins. To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mGpos/, by which users can easily get their desired results without the need to go through the detailed mathematics.
Article
The ability of evolutionary algorithms (EAs) to manage a set of solutions, even attending multiple objectives, as well as their ability to optimize any kinds of values, allows them to fit very well some parts of the data‐mining (DM) problems, whose native learning techniques usually associated with the inherent DM problem are not able to solve. Therefore, EAs are widely applied to complement or even replace the classical DM learning approaches. This application of EAs to the DM process is usually named evolutionary data mining (EDM). This contribution aims at showing a glimpse of the EDM field current state by focusing on the most cited papers published in the last 10 years. A descriptive analysis of the papers together with a bibliographic study is performed in order to differentiate past and current trends and to easily focus on significant further developments. Results show that, in the case of the most cited studied papers, the use of EAs on DM tasks is mainly focused on enhancing the classical learning techniques, thus completely replacing them only when it is directly motivated by the nature of problem. The bibliographic analysis is also showing that even though EAs were the main techniques used for EDM, the emergent evolutionary computation algorithms (swarm intelligence, etc.) are becoming nowadays the most cited and used ones. Based on all these facts, some potential further directions are also discussed. WIREs Data Mining Knowl Discov 2018, 8:e1239. doi: 10.1002/widm.1239 This article is categorized under: Fundamental Concepts of Data and Knowledge > Knowledge Representation Technologies > Computational Intelligence Technologies > Classification Technologies > Prediction
Article
Information of the proteins' subcellular localization is crucially important for revealing their biological functions in a cell, the basic unit of life. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop computational tools for timely identifying their subcellular locations based on the sequence information alone. The current study is focused on the Gram-negative bacterial proteins. Although considerable efforts have been made in protein subcellular prediction, the problem is far from being solved yet. This is because mounting evidences have indicated that many Gram-negative bacterial proteins exist in two or more location sites. Unfortunately, most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions important for both basic research and drug design. In this study, by using the multi-label theory, we developed a new predictor called "pLoc-mGneg" for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple locations. Rigorous cross-validation on a high quality benchmark dataset indicated that the proposed predictor is remarkably superior to "iLoc-Gneg", the state-of-the-art predictor for the same purpose. For the convenience of most experimental scientists, a user-friendly web-server for the novel predictor has been established at http://www.jci-bioinfo.cn/pLoc-mGneg/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Article
Objectives: In this paper, a high-quality sequence encoding scheme is proposed for predicting subcellular location of apoptosis proteins. Methods: In the proposed methodology, the novel evolutionary-conservative information is introduced to represent protein sequences. Meanwhile, based on the proportion of golden section in mathematics, position-specific scoring matrix (PSSM) is divided into several blocks. Then, these features are predicted by support vector machine (SVM) and the predictive capability of proposed method is implemented by jackknife test RESULTS: The results show that the golden section method is better than no segmentation method. The overall accuracy for ZD98 and CL317 is 98.98% and 91.11%, respectively, which indicates that our method can play a complimentary role to the existing methods in the relevant areas. Conclusions: The proposed feature representation is powerful and the prediction accuracy will be improved greatly, which denotes our method provides the state-of-the-art performance for predicting subcellular location of apoptosis proteins.
Article
Full-text available
Pse-in-One 2.0 is a package of web-servers evolved from Pse-in-One (Liu, B., Liu, F., Wang, X., Chen, J. Fang, L. & Chou, K.C. Nucleic Acids Research, 2015, 43:W65-W71). In order to make it more flexible and comprehensive as suggested by many users, the updated package has incorporated 23 new pseudo component modes as well as a series of new feature analysis approaches. It is available at http://bioinformatics.hitsz.edu.cn/Pse-in-One2.0/. Moreover, to maximize the convenience of users, provided is also the stand-alone version called “Pse-in-One-Analysis”, by which users can significantly speed up the analysis of massive sequences.
Article
Full-text available
Fifteen physicochemical descriptors of side chains of the 20 natural and of 26 non-coded amino acids are compiled and simple methods for their evaluation described. The relevance of these parameters to account for hydrophobic, steric, and electric properties of the side chains is assessed and their intercorrelation analyzed. It is shown that three principal components, one steric, one bulk, and one electric (electronic), account for 66% of the total variance in the available set. These parameters may prove to be useful for correlation studies in series of bioactive peptide analogues.
Article
Full-text available
Motivation: Most of the existing methods in predicting protein subcellular location were used to deal with the cases limited within the scope from two to five localizations, and only a few of them can be effectively extended to cover the cases of 12-14 localizations. This is because the more the locations involved are, the poorer the success rate would be. Besides, some proteins may occur in several different subcellular locations, i.e. bear the feature of 'multiplex locations'. So far there is no method that can be used to effectively treat the difficult multiplex location problem. The present study was initiated in an attempt to address (1) how to efficiently identify the localization of a query protein among many possible subcellular locations, and (2) how to deal with the case of multiplex locations. Results: By hybridizing gene ontology, functional domain and pseudo amino acid composition approaches, a new method has been developed that can be used to predict subcellular localization of proteins with multiplex location feature. A global analysis of the proteins in budding yeast classified into 22 locations was performed by jack-knife cross-validation with the new method. The overall success identification rate thus obtained is 70%. In contrast to this, the corresponding rates obtained by some other existing methods were only 13-14%, indicating that the new method is very powerful and promising. Furthermore, predictions were made for the four proteins whose localizations could not be determined by experiments, as well as for the 236 proteins whose localizations in budding yeast were ambiguous according to experimental observations. However, according to our predicted results, many of these 'ambiguous proteins' were found to have the same score and ranking for several different subcellular locations, implying that they may simultaneously exist, or move around, in these locations. This finding is intriguing because it reflects the dynamic feature of these proteins in a cell that may be associated with some special biological functions.
Article
Full-text available
Fifteen physicochemical descriptors of side chains of the 20 natural and of 26 non-coded amino acids are compiled and simple methods for their evaluation described. The relevance of these parameters to account for hydrophobic, steric, and electric properties of the side chains is assessed and their intercorrelation analyzed. It is shown that three principal components, one steric, one bulk, and one electric (electronic), account for 66% of the total variance in the available set. These parameters may prove to be useful for correlation studies in series of bioactive peptide analogues.
Article
Full-text available
A method is presented for locating protein antigenic determinants by analyzing amino acid sequences in order to find the point of greatest local hydrophilicity. This is accomplished by assigning each amino acid a numerical value (hydrophilicity value) and then repetitively averaging these values along the peptide chain. The point of highest local average hydrophilicity is invariably located in, or immediately adjacent to, an antigenic determinant. It was found that the prediction success rate depended on averaging group length, with hexapeptide averages yielding optimal results. The method was developed using 12 proteins for which extensive immunochemical analysis has been carried out and subsequently was used to predict antigenic determinants for the following proteins: hepatitis B surface antigen, influenza hemagglutinins, fowl plague virus hemagglutinin, human histocompatibility antigen HLA-B7, human interferons, Escherichia coli and cholera enterotoxins, ragweed allergens Ra3 and Ra5, and streptococcal M protein. The hepatitis B surface antigen sequence was synthesized by chemical means and was shown to have antigenic activity by radioimmunoassay.
Article
Full-text available
A protein is usually classified into one of the following five structural classes: alpha, beta, alpha + beta, alpha/beta, and zeta (irregular). The structural class of a protein is correlated with its amino acid composition. However, given the amino acid composition of a protein, how may one predict its structural class? Various efforts have been made in addressing this problem. This review addresses the progress in this field, with the focus on the state of the art, which is featured by a novel prediction algorithm and a recently developed database. The novel algorithm is characterized by a covariance matrix that takes into account the coupling effect among different amino acid components of a protein. The new database was established based on the requirement that the classes should have (1) as many nonhomologous structures as possible, (2) good quality structure, and (3) typical or distinguishable features for each of the structural classes concerned. The very high success rate for both the training-set proteins and the testing-set proteins, which has been further validated by a simulated analysis and a jackknife analysis, indicates that it is possible to predict the structural class of a protein according to its amino acid composition if an ideal and complete database can be established. It also suggests that the overall fold of a protein is basically determined by its amino acid composition.
Article
Full-text available
Biological processes in any living organism are based on selective interactions between particular biomolecules. In most cases, these interactions involve and are driven by proteins which are the main conductors of any living process within the organism. The physical nature of these interactions is still not well known. This paper represents a whole new view to biomolecular interactions, in particular protein-protein and protein-DNA interactions, based on the assumption that these interactions are electromagnetic in their nature. This new approach is incorporated in the Resonant Recognition Model (RRM), which was developed over the last 10 years. It has been shown initially that certain periodicities within the distribution of energies of delocalized electrons along a protein molecule are critical for protein biological function, i.e., interaction with its target. If protein conductivity was introduced, then a charge moving through protein backbone can produce electromagnetic irradiation or absorption with spectral characteristics corresponding to energy distribution along the protein. The RRM enables these spectral characteristics, which were found to be in the range of infrared and visible light, to be calculated. These theoretically calculated spectra were proved using experimentally obtained frequency characteristics of some light-induced biological processes. Furthermore, completely new peptides with desired spectral characteristics, and consequently corresponding biological activities, were designed.
Article
Full-text available
In multicellular organisms, mutations in somatic cells affecting critical genes that regulate cell proliferation and survival cause fatal cancers. Repair of the damage is one obvious option, although the relative inconsequence of individual cells in metazoans means that it is often a "safer" strategy to ablate the offending cell. Not surprisingly, corruption of the machinery that senses or implements DNA damage greatly predisposes to cancer. Nonetheless, even when oncogenic mutations do occur, there exist potent mechanisms that limit the expansion of affected cells by suppressing their proliferation or triggering their suicide. Growing understanding of these innate mechanisms is suggesting novel therapeutic strategies for cancer.
Article
Full-text available
Apoptosis is one of the most exciting and intensely investigated areas of biology and medicine today. Cysteine proteases called caspases serve as the executioners of apoptosis, a form of cell suicide. Hypoxic/ischemic cell death proceeds in part, by apoptosis, particularly within the periinfarct zone or ischemic penumbra. During ischemia, activated caspases dismantle the cell by cleaving multiple substrates including cytoskeletal proteins and enzymes essential for cell repair. Strategies that inhibit caspase activity block cell death in experimental models of mild ischemia, and preserve neurological function. The therapeutic window for caspase inhibition is substantially longer than for glutamate receptor antagonists, and treatment combinations with both classes of drugs decrease ischemic injury and expand the treatment window synergistically. Hence, the caspases are now recognized as novel therapeutic targets for central nervous system diseases in which cell death is prominent. This article will review the evidence and the potential importance of caspase inhibition to cerebral ischemia and briefly summarize an emerging body of data implicating caspases in cell death accompanying neurodegenerative disorders.
Article
Full-text available
Entropy, as it relates to dynamical systems, is the rate of information production. Methods for estimation of the entropy of a system represented by a time series are not, however, well suited to analysis of the short and noisy data sets encountered in cardiovascular and other biological studies. Pincus introduced approximate entropy (ApEn), a set of measures of system complexity closely related to entropy, which is easily applied to clinical cardiovascular and other time series. ApEn statistics, however, lead to inconsistent results. We have developed a new and related complexity measure, sample entropy (SampEn), and have compared ApEn and SampEn by using them to analyze sets of random numbers with known probabilistic character. We have also evaluated cross-ApEn and cross-SampEn, which use cardiovascular data sets to measure the similarity of two distinct time series. SampEn agreed with theory much more closely than ApEn over a broad range of conditions. The improved accuracy of SampEn statistics should make them useful in the study of experimental clinical cardiovascular and other biological time series.
Article
Full-text available
Techniques to determine changing system complexity from data are evaluated. Convergence of a frequently used correlation dimension algorithm to a finite value does not necessarily imply an underlying deterministic model or chaos. Analysis of a recently developed family of formulas and statistics, approximate entropy (ApEn), suggests that ApEn can classify complex systems, given at least 1000 data values in diverse settings that include both deterministic chaotic and stochastic processes. The capability to discern changing complexity from such a relatively small amount of data holds promise for applications of ApEn in a variety of contexts.
Article
Full-text available
Proteins are generally classified into the following 12 subcellular locations: 1) chloroplast, 2) cytoplasm, 3) cytoskeleton, 4) endoplasmic reticulum, 5) extracellular, 6) Golgi apparatus, 7) lysosome, 8) mitochondria, 9) nucleus, 10) peroxisome, 11) plasma membrane, and 12) vacuole. Because the function of a protein is closely correlated with its subcellular location, with the rapid increase in new protein sequences entering into databanks, it is vitally important for both basic research and pharmaceutical industry to establish a high throughput tool for predicting protein subcellular location. In this paper, a new concept, the so-called "functional domain composition" is introduced. Based on the novel concept, the representation for a protein can be defined as a vector in a high-dimensional space, where each of the clustered functional domains derived from the protein universe serves as a vector base. With such a novel representation for a protein, the support vector machine (SVM) algorithm is introduced for predicting protein subcellular location. High success rates are obtained by the self-consistency test, jackknife test, and independent dataset test, respectively. The current approach not only can play an important complementary role to the powerful covariant discriminant algorithm based on the pseudo amino acid composition representation (Chou, K. C. (2001) Proteins Struct. Funct. Genet. 43, 246-255; Correction (2001) Proteins Struct. Funct. Genet. 44, 60), but also may greatly stimulate the development of this area.
Article
Full-text available
Apoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death. Many efforts in pharmaceutical research have been aimed at understanding their structure and function. Unfortunately, thus far, very few apoptosis protein structures have been determined. In contrast, many apoptosis protein sequences are known, and many more are expected to come in the near future. Because of the extremely unbalanced state, it would be worthwhile to develop a fast sequence-based method to identify their subcellular location so as to gain some insight about their biological function. In view of this, a study was initiated in an attempt to identify the subcellular location of apoptosis proteins according to their sequences by means of the covariant discriminant function, which was established based on the Mahalanobis distance and Chou's invariance theorem (Chou, Proteins 1995;21:319-344). The results were quite promising, indicating that the subcellular location of apoptosis proteins are predictable to a considerably accurate extent if a good training data set can be established. It is expected that, with a continuous improvement of the training data set by incorporating more and more new data, the current method might eventually become a useful tool in this area because the function of an apoptosis protein is closely related to its subcellular location.
Article
Full-text available
During the last two decades, the number of sequence-known proteins has increased rapidly. In contrast, the corresponding increment for structure-known proteins is much slower. The unbalanced situation has critically limited our ability to understand the molecular mechanism of proteins and conduct structure-based drug design by timely using the updated information of newly found sequences. Therefore, it is highly desired to develop an automated method for fast deriving the 3D (3-dimensional) structure of a protein from its sequence. Under such a circumstance, the structural bioinformatics was emerging naturally as the time required. In this review, three main strategies developed in structural bioinformatics, i.e., pure energetic approach, heuristic approach, and homology modeling approach, as well as their underlying principles, are briefly introduced. Meanwhile, a series of demonstrations are presented to show how the structural bioinformatics has been applied to timely derive the 3D structures of some functionally important proteins, helping to understand their action mechanisms and stimulating the course of drug discovery. Also, the limitation of these approaches and the future challenges of structural bioinformatics are briefly addressed.
Article
Full-text available
Motivation: With protein sequences entering into databanks at an explosive pace, the early determination of the family or subfamily class for a newly found enzyme molecule becomes important because this is directly related to the detailed information about which specific target it acts on, as well as to its catalytic process and biological function. Unfortunately, it is both time-consuming and costly to do so by experiments alone. In a previous study, the covariant-discriminant algorithm was introduced to identify the 16 subfamily classes of oxidoreductases. Although the results were quite encouraging, the entire prediction process was based on the amino acid composition alone without including any sequence-order information. Therefore, it is worthy of further investigation. Results: To incorporate the sequence-order effects into the predictor, the 'amphiphilic pseudo amino acid composition' is introduced to represent the statistical sample of a protein. The novel representation contains 20 + 2lambda discrete numbers: the first 20 numbers are the components of the conventional amino acid composition; the next 2lambda numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain. Based on such a concept and formulation scheme, a new predictor is developed. It is shown by the self-consistency test, jackknife test and independent dataset tests that the success rates obtained by the new predictor are all significantly higher than those by the previous predictors. The significant enhancement in success rates also implies that the distribution of hydrophobicity and hydrophilicity of the amino acid residues along a protein chain plays a very important role to its structure and function.
Article
Full-text available
Recent advances in large-scale genome sequencing have led to the rapid accumulation of amino acid sequences of proteins whose functions are unknown. Because the functions of these proteins are closely correlated with their subcellular localizations, it is vitally important to develop an automated method as a high-throughput tool to timely identify their subcellular location. Based on the concept of the pseudo amino acid composition by which a considerable amount of sequence-order effects can be incorporated into a set of discrete numbers (Chou, K. C., Proteins: Structure, Function, and Genetics, 2001, 43: 246-255), the complexity measure approach is introduced. The advantage by incorporating the complexity measure factor as one of the pseudo amino acid components for a protein is that it can more effectively reflect its overall sequence-order feature than the conventional correlation factors. With such a formulation frame to represent the samples of protein sequences, the covariant-discriminant predictor (Chou, K. C. and Elrod, D. W., Protein Engineering, 1999, 12: 107-118) was adopted to conduct prediction. High success rates were obtained by both the jackknife cross-validation test and independent dataset test, suggesting that introduction of the concept of the complexity measure into prediction of protein subcellular location is quite promising, and might also hold a great potential as a useful vehicle for the other areas of molecular biology.
Article
Full-text available
With the avalanche of new protein sequences we are facing in the post-genomic era, it is vitally important to develop an automated method for fast and accurately determining the subcellular location of uncharacterized proteins. In this article, based on the concept of pseudo amino acid composition (Chou, K.C. Proteins: Structure, Function, and Genetics, 2001, 43: 246-255), three pseudo amino acid components are introduced via Lyapunov index, Bessel function, Chebyshev filter that can be more efficiently used to deal with the chaos and complexity in protein sequences, leading to a higher success rate in predicting protein subcellular location.
Article
Classification of objects is an important area of research and application in a variety of fields. In the presence of full knowledge of the underlying probabilities, Bayes decision theory gives optimal error rates. In those cases where this information is not present, many algorithms make use of distance or similarity among samples as a means of classification. The K-nearest neighbor decision rule has often been used in these pattern recognition problems. One of the difficulties that arises when utilizing this technique is that each of the labeled samples is given equal importance in deciding the class memberships of the pattern to be classified, regardless of their 'typicalness'. The theory of fuzzy sets is introduced into the K-nearest neighbor technique to develop a fuzzy version of the algorithm. Three methods of assigning fuzzy memberships to the labeled samples are proposed, and experimental results and comparisons to the crisp version are presented.
Article
A simple model is developed for calculation of the difference in free energy (ΔF) between the native and unfolded forms of a protein molecule in solution. A major term in the expression for ΔF arises from the increase in entropy which accompanies unfolding. This term is negative, i.e., it favors the unfolded form. In water, therefore, where a compact globular conformation is stable, local interactions must exist which make a large positive contribution to ΔF. One such interaction in the hydrophobic interaction, which results from the unfavorable arrangement of water molecules which takes place whenever there is contact between water and a non-polar portion of a protein molecule. There are many such contacts when the protein molecule is unfolded, but relatively few in the native state, so that a positive contribution to ΔF results. When amino acids with non-polar side chains are dissolved in water, the same interactions must occur. The magnitude of these interactions can then be estimated from relative solubilities of appropriate amino acids in water and other solvents. Such estimates are made in this paper, and the conclusion is that these hydrophobic interactions alone may be able to account for the instability of an unfolded protein, relative to a suitable globular conformation, in aqueous solution. The model used cannot predict the structure which will be adopted by a given protein molecule in its native state. General considerations suggest, however, that the hydrophobic interactions are compatible with a large variety of structures and that specificity of structure is at least partly due to hydrogen bonds between peptide groups (as well as other polar groups) trapped within the hydrophobic interior.
Article
Membrane proteins are classified according to two different schemes. In scheme 1, they are discriminated among the following five types: (1) type I single-pass transmembrane, (2) type II single-pass transmembrane, (3) multipass transmembrane, (4) lipid chain-anchored membrane, and (5) GPI-anchored membrane proteins. In scheme 2, they are discriminated among the following nine locations: (1) chloroplast, (2) endoplasmic reticulum, (3) Golgi apparatus, (4) lysosome, (5) mitochondria, (6) nucleus, (7) peroxisome, (8) plasma, and (9) vacuole. An algorithm is formulated for predicting the type or location of a given membrane protein based on its amino acid composition. The overall rates of correct prediction thus obtained by both self-consistency and jackknife tests, as well as by an independent dataset test, were around 76–81% for the classification of five types, and 66–70% for the classification of nine cellular locations. Furthermore, classification and prediction were also conducted between inner and outer membrane proteins; the corresponding rates thus obtained were 88–91%. These results imply that the types of membrane proteins, as well as their cellular locations and other attributes, are closely correlated with their amino acid composition. It is anticipated that the classification schemes and prediction algorithm can expedite the functionality determination of new proteins. The concept and method can be also useful in the prioritization of genes and proteins identified by genomics efforts as potential molecular targets for drug design. Proteins 1999;34:137–153. © 1999 Wiley-Liss, Inc.
Article
A prediction algorithm based on physical characteristis of the twenty amino acids and refined by comparison to the proposed bacteriorhodopsin structure was devised to delineate likely membrane-buried regions in the primary sequences of proteins known to interact with the lipid bilayer. Application of the method to the sequence of the carboxyl terminal one-third of bovine rhodopsin predicted a membrane-buried helical hairpin structure. With the use of lipid-buried segments in bacteriorhodopsin as well as regions predicted by the algorithm in other membrane-bound proteins, a hierarchical ranking of the twenty amino acids in their preferences to be in lipid contact was calculated. A helical wheel analysis of the predicted regions suggests which helical faces are within the protein interior and which are in contact with the lipid bilayer.
Article
Identification of Nuclear protein localization assumes significance as it can provide in depth insight for genome regulation and function annotation of novel proteins. A multiclass SVM classifier with various input features was employed for nuclear protein compartment identification. The input features include factor solution scores and evolutionary information (position specific scoring matrix (PSSM) score) apart from conventional dipeptide composition and pseudo amino acid composition. All the SVM classifiers with different sets of input features performed better than the previously available prediction classifiers. The jack-knife success rate thus obtained on the benchmark dataset constructed by Shen and Chou [Shen, H.B., Chou, K.C., 2005, Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition. Biochem. Biophys. Res. Commun. 337, 752–756] is 71.23%, indicating that the novel pseudo amino acid composition approach with PSSM and SVM classifier is very promising and may at least play a complimentary role to the existing methods.
Article
We have analysed the side-chain dihedral angles in 2536 residues from 19 protein structures. The distributions of x1 and x2 are compared with predictions made on the basis of simple energy calculations. The x1 distribution is trimodal; the g− position of the side-chain (trans to Hα), which is rare except in serine, the t position (trans to the amino group), and the g+ position (trans to the carbonyl group), which is preferred in all residues. Characteristic x2 distributions are observed for residues with a tetrahedral γ-carbon, for aromatic residues, and for aspartic acid/asparagine. The number of configurations actually observed is small for all types of side-chains, with 60% or more of them in only one or two configurations. We give estimates of the experimental errors on x1 and x2 (3 ° to 16 °, depending on the type of the residue), and show that the dihedral angles remain within 15 ° to 18 ° (standard deviation) from the configurations with the lowest calculated energies. The distribution of the side-chains among the permitted configurations varies slightly with the conformation of the main chain, and with the position of the residue relative to the protein surface. Configurations that are rare for exposed residues are even rarer for buried residues, suggesting that, while the folded structure puts little strain on side-chain conformations, the side-chain positions with the lowest energy in the unfolded structure are chosen preferentially during folding.
Article
Although most proteins have a single subcellular location, some may simultaneously exist at two or more different subcellular locations. Multiplex proteins as such are particularly interesting because they may have some special functions. To deal with this kind of complicated situation, a novel predictor called “Euk-mPLoc” was developed that can be used to predict the subcellular locations of eukaryotic proteins among the 22 sites as shown in the accompanying figure. The predictor is accessible to the public as a free server at http://202.120.37.186/bioinf/euk-multi.
Article
MEASUREMENTS of the surface area accessible to solvent provide a convenient definition of the surface and the inside volumes in proteins of known X-ray structure. The study of the accessibility to solvent of amino acid residues in several proteins1-3 has confirmed the early observation that polar residues are found mostly on the surface and non-polar residues mostly inside globular protein structures. But the accessibility shows systematic variations with the molecular weight, because of the change in surface to volume ratio. Experimental data4 indicate that the accessible surface area A (in Å2) of monomeric globular proteins follows the law5,6: which implies that the mean accessible surface area per residues decreases like M -1/3 (where M is molecular weight) with increasing M, from about 68 Å2 for proteins of 6,000 molecular weight to 38 Å2 for proteins of 35,000 molecular weight. Thus, the accessibility to solvent is not a characteristic of the amino acids.
Article
Three different but related comprehensive statistical analyses of amino acid sequences in proteins are described. The goal in each case is to search for evidence of significant sequence structure in individual proteins relative to a purely random arrangement of the amino acid residues and to attempt to relate any significant structure uncovered to the secondary and/or tertiary configuration of the protein.In the first of these analyses, which is reviewed briefly in an appendix, amino acids are divided into subgroups according to a variety of side chain physical properties (e.g. polarity, hydrophobicity). Deviations from randomness are expressed in terms of correlation indices ϱij(c) which are composition normalized doublet frequencies. Here i and j denote membership in a particular group for the physical property chosen and c denotes the “lag”, that is the number of residues along the chain separating the doublet.The other more refined analyses are described in some detail. For both of these each amino acid in a given protein is replaced by its appropriate value on a continuous physical property scale. Six such scales are employed: bulkiness, polarity, RF, pI, pK1 and hydrophobicity. The resulting amino acid index sequences are treated as discrete series and are analyzed first by means of serial correlation methods and subsequently by employing spectral analysis techniques. Periodicities exhibited in these series are evaluated statistically and speculations are made concerning the connection between such structure and protein configuration.Although more than forty individual proteins whose primary sequences are known have been analyzed by these methods, results for the cytochrome c series, the hemoglobins and lysozyme are emphasized in the present paper. In the case of the cytochrome c family of proteins several relationships between primary sequence structure and “evolutionary order” are discussed. In addition, the results of several homogeneity studies are described in which the sequence structure of various portions of a given protein chain are compared.
Article
A prediction algorithm based on physical characteristics of the twenty amino acids and refined by comparison to the proposed bacteriorhodopsin structure was devised to delineate likely membrane-buried regions in the primary sequences of proteins known to interact with the lipid bilayer. Application of the method to the sequence of the carboxyl terminal one-third of bovine rhodopsin predicted a membrane-buried helical hairpin structure. With the use of lipid-buried segments in bacteriorhodopsin as well as regions predicted by the algorithm in other membrane-bound proteins, a hierarchical ranking of the twenty amino acids in their preferences to be in lipid contact was calculated. A helical wheel analysis of the predicted regions suggests which helical faces are within the protein interior and which are in contact with the lipid bilayer.
Article
Sequences of intracellular and extracellular soluble proteins were analyzed statistically in terms of amino acid composition and residue-pair frequencies. Residue-pair frequencies were calculated for sequential separations from (n, n + 1) to (n, n + 5), and converted into scoring parameters. Then, for each test protein, the single-residue and residue-pair parameters were applied to calculate a total score. According to our definition, a protein which yields a positive score is indicative of an intracellular protein, whereas a negative score implies an extracellular one. The parameter set was derived from 894 sequences constituting different protein families in the PIR database, and assessed by application to a test of 379 proteins. The results showed that 88% of intracellular and 84% of extracellular proteins were correctly assigned. The discrimination power was improved by about 8% in comparison with the previous study, which used composition data alone. Segregation of intra/extracellular proteins is also observed by other criteria, such as structural class (intracellular proteins prefer alpha and alpha/beta types and extracellular proteins prefer beta and alpha + beta types). Segregation by sequence was found to be a more reliable procedure for distinguishing intra/extracellular proteins than methods using structural class. Possible causes for this segregation by sequence are discussed.
Article
In multicellular organisms, mutations in somatic cells affecting critical genes that regulate cell proliferation and survival cause fatal cancers. Repair of the damage is one obvious option, although the relative inconsequence of individual cells in metazoans means that it is often a “safer” strategy to ablate the offending cell. Not surprisingly, corruption of the machinery that senses or implements DNA damage greatly predisposes to cancer. Nonetheless, even when oncogenic mutations do occur, there exist potent mechanisms that limit the expansion of affected cells by suppressing their proliferation or triggering their suicide. Growing understanding of these innate mechanisms is suggesting novel therapeutic strategies for cancer.
Article
The caspases represent a family of sulfhydryl proteases that play important regulatory roles in the cell. The tertiary structure of the protease domain of caspase-8, also called FLICE, has been predicted by a segment match modeling procedure. First, the atomic coordinates of the catalytic domain of caspase-3, also called CPP32, a member of the family that is closely related to caspase-8, were determined based upon the crystal structure of human caspase-1 (interleukin converting enzyme). Then, the caspase-3 structure was used as a template for modeling the protease domain of caspase-8. The resulting structure shows the expected level of similarity with the conformations of caspases-1 and -3 for which crystal structures have been determined. Moreover, the subsite contacts between caspase-8 and the covalently linked inhibitor, Ac-DEVD-aldehyde, are only slightly different from those seen in the caspase-3 enzyme/inhibitor complex. The model of caspase-8 can serve as a reference for subsite analysis relative to design of enzyme inhibitors that may find therapeutic application.
Article
Apoptosis requires recruitment of caspases by receptor-associated adaptors through homophilic interactions between the CARDs (caspase recruitment domains) of adaptor proteins and prodomains of caspases. We have solved the CARD structure of the RAIDD adaptor protein that recruits ICH-1/caspase-2. It consists of six tightly packed helices arranged in a topology homologous to the Fas death domain. The surface contains a basic and an acidic patch on opposite sides. This polarity is conserved in the ICH-1 CARD as indicated by homology modeling. Mutagenesis data suggest that these patches mediate CARD/CARD interaction between RAIDD and ICH-1. Subsequent modeling of the CARDs of Apaf-1 and caspase-9, as well as Ced-4 and Ced-3, showed that the basic/acidic surface polarity is highly conserved, suggesting a general mode for CARD/CARD interaction.
Article
REVIEW Bcl-2 and related cytoplasmic proteins are key regulators of apoptosis, the cell suicide program critical for development, tissue homeostasis, and protection against pathogens. Those most similar to Bcl-2 promote cell survival by inhibiting adapters needed for activation of the proteases (caspases) that dismantle the cell. More distant relatives instead promote apoptosis, apparently through mechanisms that include displacing the adapters from the pro-survival proteins. Thus, for many but not all apoptotic signals, the balance between these competing activities determines cell fate. Bcl-2 family members are essential for maintenance of major organ systems, and mutations affecting them are implicated in cancer.
Article
AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. It consists of two sections: AAindex1 for the amino acid index of 20 numerical values and AAindex2 for the amino acid mutation matrix of 210 numerical values. Each entry of either AAindex1 or AAindex2 consists of the definition, the reference information, a list of related entries in terms of the correlation coefficient, and the actual data. The database may be accessed through the DBGET/LinkDB system at GenomeNet (http://www.genome.ad. jp/dbget/) or may be downloaded by anonymous FTP (ftp://ftp.genome. ad.jp/db/genomenet/aaindex/).
Article
Membrane proteins are classified according to two different schemes. In scheme 1, they are discriminated among the following five types: (1) type I single-pass transmembrane, (2) type II single-pass transmembrane, (3) multipass transmembrane, (4) lipid chain-anchored membrane, and (5) GPI-anchored membrane proteins. In scheme 2, they are discriminated among the following nine locations: (1) chloroplast, (2) endoplasmic reticulum, (3) Golgi apparatus, (4) lysosome, (5) mitochondria, (6) nucleus, (7) peroxisome, (8) plasma, and (9) vacuole. An algorithm is formulated for predicting the type or location of a given membrane protein based on its amino acid composition. The overall rates of correct prediction thus obtained by both self-consistency and jackknife tests, as well as by an independent dataset test, were around 76-81% for the classification of five types, and 66-70% for the classification of nine cellular locations. Furthermore, classification and prediction were also conducted between inner and outer membrane proteins; the corresponding rates thus obtained were 88-91%. These results imply that the types of membrane proteins, as well as their cellular locations and other attributes, are closely correlated with their amino acid composition. It is anticipated that the classification schemes and prediction algorithm can expedite the functionality determination of new proteins. The concept and method can be also useful in the prioritization of genes and proteins identified by genomics efforts as potential molecular targets for drug design.
Article
The biochemical basis for most of the morphological changes associated with apoptosis can be traced directly or indirectly to the actions of caspases, a family of intracellular cysteine proteases that function as effectors of programmed cell death (1, 2). Much of the recent progress toward mapping pathways for caspase activation has come from evaluations of normal dividing cells or established tumor lines, where obtaining large numbers of cells for biochemical analysis or transferring genes for functional analysis is readily possible. But, do all types of animal cells contain the same wiring instructions when it comes to connecting steps in cell suicide pathways? Researchers studying cell death in the heart are beginning to probe this question, and they are finding some surprises.
Article
AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. It consists of two sections: AAindex1 for the amino acid index of 20 numerical values and AAindex2 for the amino acid mutation matrix of 210 numerical values. Each entry of either AAindex1 or AAindex2 consists of the definition, the reference information, a list of related entries in terms of the correlation coefficient, and the actual data. The database may be accessed through the DBGET/LinkDB system at GenomeNet (http://www.genome.ad.jp/dbget/) or may be downloaded by anonymous FTP (ftp://ftp.genome.ad.jp/db/genomenet/aaindex/).
Article
Apoptosis, or programmed cell death, plays a central role in the development and homeostasis of an organism. The breakdown of cellular proteins in apoptosis is mediated by caspases, which comprise a highly conserved family of cysteine proteases with specificity for aspartic acid residues at the P1 positions of their substrates. Multiple lines of evidence show that caspase-9 is critical for an apoptosis pathway mediated via the mitochondria. In this study, the three-dimensional structure of the catalytic domain of caspase-9 and its interaction with the inhibitor acetyl-Asp-Val-Ala-Asp fluoromethyl ketone (Ac-DVAD-fmk) have been predicted by a segment matching modeling procedure. As expected, the predicted caspase-9 structure shows both a high similarity in the overall folding topology and remarkable differences in the surface loop regions as compared to other caspase family members such as caspase-1, -3 and -8, for which crystal structures have been determined. This kind of comparative analysis reflects the convergence-divergence duality among the caspases. Moreover, some subtle differences have been observed between caspase-9 and caspase-3 in the subsite contacts with the covalently linked inhibitor Ac-DVAD-fmk. Based on the X-ray structural analysis of caspase-8, a main chain carbonyl oxygen appears to be involved in a catalytic triad with the active site Cys and His residues. The corresponding carbonyl oxygen in caspase-9, together with other expected features of the catalytic apparatus, appears in our model. The predicted structure of caspase-9 can serve as a reference for subsite analysis relative to rational design of highly selective caspase inhibitors for therapeutic application.
Article
The cellular attributes of a protein, such as which compartment of a cell it belongs to and how it is associated with the lipid bilayer of an organelle, are closely correlated with its biological functions. The success of human genome project and the rapid increase in the number of protein sequences entering into data bank have stimulated a challenging frontier: How to develop a fast and accurate method to predict the cellular attributes of a protein based on its amino acid sequence? The existing algorithms for predicting these attributes were all based on the amino acid composition in which no sequence order effect was taken into account. To improve the prediction quality, it is necessary to incorporate such an effect. However, the number of possible patterns for protein sequences is extremely large, which has posed a formidable difficulty for realizing this goal. To deal with such a difficulty, the pseudo-amino acid composition is introduced. It is a combination of a set of discrete sequence correlation factors and the 20 components of the conventional amino acid composition. A remarkable improvement in prediction quality has been observed by using the pseudo-amino acid composition. The success rates of prediction thus obtained are so far the highest for the same classification schemes and same data sets. It has not escaped from our notice that the concept of pseudo-amino acid composition as well as its mathematical framework and biochemical implication may also have a notable impact on improving the prediction quality of other protein features.
Article
The structural class and subcellular location are the two important features of proteins that are closely related to their biological functions. With the rapid increase in new protein sequences entering into data banks, it is highly desirable to develop a fast and accurate method for predicting the attributes of these features for them. This can expedite the functionality determination of new proteins and the process of prioritizing genes and proteins identified by genomics efforts as potential molecular targets for drug design. Various prediction methods have been developed during the last two decades. This review is devoted to presenting a systematic introduction and comparison of the existing methods in respect to the prediction algorithm and classification scheme. The attention is focused on the state-of-the-art, which is featured by the covarient-discriminant algorithm developed very recently, as well as some new classification schemes for protein structural classes and subcellular locations. Particularly, addressed are also the physical chemistry foundation of the existing prediction methods, and the essence why the covariant-discriminant algorithm is so powerful.
Article
In this paper, based on the approach by combining the "functional domain composition" [K.C. Chou, Y. D. Cai, J. Biol. Chem. 277 (2002) 45765] and the pseudo-amino acid composition [K.C. Chou, Proteins Struct. Funct. Genet. 43 (2001) 246; Correction Proteins Struct. Funct. Genet. 2044 (2001) 2060], the Nearest Neighbour Algorithm (NNA) was developed for predicting the protein subcellular location. Very high success rates were observed, suggesting that such a hybrid approach may become a useful high-throughput tool in the area of bioinformatics and proteomics.
Article
Motivation: The subcellular location of a protein is closely correlated to its function. Thus, computational prediction of subcellular locations from the amino acid sequence information would help annotation and functional prediction of protein coding genes in complete genomes. We have developed a method based on support vector machines (SVMs). Results: We considered 12 subcellular locations in eukaryotic cells: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracellular medium, Golgi apparatus, lysosome, mitochondrion, nucleus, peroxisome, plasma membrane, and vacuole. We constructed a data set of proteins with known locations from the SWISS-PROT database. A set of SVMs was trained to predict the subcellular location of a given protein based on its amino acid, amino acid pair, and gapped amino acid pair compositions. The predictors based on these different compositions were then combined using a voting scheme. Results obtained through 5-fold cross-validation tests showed an improvement in prediction accuracy over the algorithm based on the amino acid composition only. This prediction method is available via the Internet.
Article
The present paper overviews the issue on predicting the subcellular location of a protein. Five measures of extracting information from the global sequence based on the Bayes discriminant algorithm are reviewed. 1) The auto-correlation functions of amino acid indices along the sequence; 2) The quasi-sequence-order approach; 3) the pseudo-amino acid composition; 4) the unified attribute vector in Hilbert space, 5) Zp parameters extracted from the Zp curve. The actual performance of the predictive accuracy is closely related to the degree of similarity between the training and testing sets or to the average degree of pairwise similarity in dataset in a cross-validated study. Many scholars considered that the current higher predictive accuracy still cannot ensure that some available algorithms are effective in practice prediction for the higher pairwise sequence identity of the datasets, but some of them declared that construction of the dataset used for developing software should base on the reality determined by the Mother Nature that some subcellular locations really contain only a minor number of proteins of which some even have a high percentage of sequence similarity. Owing to the complexity of the problem itself, some very sophisticated and special programs are needed for both constructing dataset and improving the prediction. Anyhow finding the target information in mature protein sequence and properly cooperating it with sorting signals in prediction may further improve the overall predictive accuracy and make the prediction into practice.
Article
The localization of a protein in a cell is closely correlated with its biological function. With the explosion of protein sequences entering into DataBanks, it is highly desired to develop an automated method that can fast identify their subcellular location. This will expedite the annotation process, providing timely useful information for both basic research and industrial application. In view of this, a powerful predictor has been developed by hybridizing the gene ontology approach [Nat. Genet. 25 (2000) 25], functional domain composition approach [J. Biol. Chem. 277 (2002) 45765], and the pseudo-amino acid composition approach [Proteins Struct. Funct. Genet. 43 (2001) 246; Erratum: ibid. 44 (2001) 60]. As a showcase, the recently constructed dataset [Bioinformatics 19 (2003) 1656] was used for demonstration. The dataset contains 7589 proteins classified into 12 subcellular locations: chloroplast, cytoplasmic, cytoskeleton, endoplasmic reticulum, extracellular, Golgi apparatus, lysosomal, mitochondrial, nuclear, peroxisomal, plasma membrane, and vacuolar. The overall success rate of prediction obtained by the jackknife cross-validation was 92%. This is so far the highest success rate performed on this dataset by following an objective and rigorous cross-validation procedure.
Article
Apoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death, and their function is related to their types. According to the classification scheme by Zhou and Doctor (2003), the apoptosis proteins are categorized into the following four types: (1) cytoplasmic protein; (2) plasma membrane-bound protein; (3) mitochondrial inner and outer proteins; (4) other proteins. A powerful learning machine, the Support Vector Machine, is applied for predicting the type of a given apoptosis protein by incorporating the sqrt-amino acid composition effect. High success rates were obtained by the re-substitute test (98/98 = 100 %) and the jackknife test (89/98 = 90.8%).
Article
To understand the structure and function of a protein, an important task is to know where it occurs in the cell. Thus, a computational method for properly predicting the subcellular location of proteins would be significant in interpreting the original data produced by the large-scale genome sequencing projects. The present work tries to explore an effective method for extracting features from protein primary sequence and find a novel measurement of similarity among proteins for classifying a protein to its proper subcellular location. We considered four locations in eukaryotic cells and three locations in prokaryotic cells, which have been investigated by several groups in the past. A combined feature of primary sequence defined as a 430D (dimensional) vector was utilized to represent a protein, including 20 amino acid compositions, 400 dipeptide compositions and 10 physicochemical properties. To evaluate the prediction performance of this encoding scheme, a jackknife test based on nearest neighbor algorithm was employed. The prediction accuracies for cytoplasmic, extracellular, mitochondrial, and nuclear proteins in the former dataset were 86.3%, 89.2%, 73.5% and 89.4%, respectively, and the total prediction accuracy reached 86.3%. As for the prediction accuracies of cytoplasmic, extracellular, and periplasmic proteins in the latter dataset, the prediction accuracies were 97.4%, 86.0%, and 79.7, respectively, and the total prediction accuracy of 92.5% was achieved. The results indicate that this method outperforms some existing approaches based on amino acid composition or amino acid composition and dipeptide composition.